Univariate Plots Section
Variables
names(ww)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Change the variable name “density” to “mass.density”
To avoid confusion of the variable “density” and distribution “denisity” in ggplot, I rename the variable “density” to “mass.density”.
colnames(ww)[colnames(ww)=="density"] <- "mass.density"
Histogram plots and frequency polygons of variables
In order to explore distribution of each individual variable, I will plot histogram plots and frequency polygons of all variables.
Function to create Histogram plots and frequency polygons
plot_hist_fre_poly <- function(x_str, bin_width, xmin, xmax, dx, ymin, ymax, dy)
{
ggplot(aes_string(x = x_str), data = ww) +
geom_histogram(binwidth = bin_width, fill = "#3366FF") +
scale_x_continuous(limits = c(xmin, xmax), breaks = seq(xmin, xmax, dx)) +
scale_y_continuous(limits = c(ymin, ymax), breaks = seq(ymin, ymax, dy)) +
ggtitle(" ") + geom_freqpoly(binwidth = bin_width, color = "red")
}
Fixed acidity
plot_hist_fre_poly(x_str = "fixed.acidity", bin_width = 0.3,
xmin = 4, xmax = 10, dx = 1,
ymin = 0, ymax = 900, dy = 100)

Normal-like distribution
Volatile acidity
plot_hist_fre_poly(x_str = "volatile.acidity", bin_width = 0.01,
xmin = 0.1, xmax = 0.5, dx = 0.1,
ymin = 0, ymax = 300, dy = 50)

Normal-like distribution
Citric acid
plot_hist_fre_poly(x_str = "citric.acid", bin_width = 0.02,
xmin = 0.1, xmax = 0.7, dx = 0.1,
ymin = 0, ymax = 550, dy = 50)

Long-tailed normal-like distribution
Residual sugar
plot_hist_fre_poly(x_str = "residual.sugar", bin_width = 0.0984,
xmin = 0, xmax = 20, dx = 2,
ymin = 0, ymax = 60, dy = 10)

Non-normal distribution
Chlorides
plot_hist_fre_poly(x_str = "chlorides", bin_width = 0.003,
xmin = 0, xmax = 0.1, dx = 0.02,
ymin = 0, ymax = 550, dy = 50)

Long-tailed normal-like distribution
Free sulfur dioxide
plot_hist_fre_poly(x_str = "free.sulfur.dioxide", bin_width = 5,
xmin = 0, xmax = 100, dx = 20,
ymin = 0, ymax = 650, dy = 50)

Long-tailed normal-like distribution
Total sulfur dioxide
plot_hist_fre_poly(x_str = "total.sulfur.dioxide", bin_width = 10,
xmin = 0, xmax = 300, dx = 50,
ymin = 0, ymax = 550, dy = 50)

Long-tailed normal-like distribution
Mass density
plot_hist_fre_poly(x_str = "mass.density", bin_width = 0.0004,
xmin = 0.985, xmax = 1.005, dx = 0.005,
ymin = 0, ymax = 300, dy = 50)
Normal-like distribution
pH
plot_hist_fre_poly(x_str = "pH", bin_width = 0.03,
xmin = 2.8, xmax = 3.6, dx = 0.2,
ymin = 0, ymax = 450, dy = 50)

Normal-like distribution
Sulphates
plot_hist_fre_poly(x_str = "sulphates", bin_width = 0.02,
xmin = 0.2, xmax = 0.8, dx = 0.1,
ymin = 0, ymax = 400, dy = 50)

Normal-like distribution
Alcohol
plot_hist_fre_poly(x_str = "alcohol", bin_width = 0.1,
xmin = 8, xmax = 14, dx = 1,
ymin = 0, ymax = 250, dy = 50)

Non-normal distribution
Quantity
plot_hist_fre_poly(x_str = "quality", bin_width = 1,
xmin = 3, xmax = 10, dx = 1,
ymin = 0, ymax = 2500, dy = 500)

Normal-like distribution
Statistical properties
summary(ww)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide mass.density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Quality changes from 3.00 to 9.00. The median quality is 6.0, which is closed to mean 5.878. About 75% of white wines are ranked in between 3.0~6.0 and about 50% of white wines are ranked in between 5.0~6.0. The residual sugar ranges from 0.6 to 65.8 with a median of 5.2 and a mean of 6.391. Thus its distribution has a long tail. The minimum and maximum of free sulfur dioxide (SO2) are 2.0 and 289.0, respectively, with a median of 34.00. Thus the distribution of free SO2 has a huge dispersion and very long tail. Similarly, the distributions of volatile acidity, citric acid, chlorides, free sulfur dioxide, and total sulfur dioxide have large dispersions and long tails. The statistical properties of all the other variables are also shown above.
Observation from univariate plots and statistical properties
Distributions of variables can be attributed into three types: Normal-like distribution, long-tailed normal-like distribution, and non-normal-like distribution.
Normal-like distributions are fixed acidity, volatile acidity, pH, sulphates, mass density, and quality.
Long-tailed normal-like distributions are citric acid, chlorides, free sulfur dioxide, and total sulfur dioxide. The long-tailed normal-like distribution can be converted to normal-like distribution by removal of outliers or use of proper scales (e.g., log scale or square root scale etc.).
Non-normal-like distributions are residual sugar and alcohol. The main distribution of residual sugar ranges from 2 to 20. Two deep dips appear around 6 and 12, respectively. They divide the distribution into three parts: [0.6, 6), [6, 12), [12, 65.8], which represent low sugar, medium sugar and high sugar, respectively. The distribution of alcohol is a three-peak distribution in the range from 8 to 14.2. It is divided into three distributions in the range of [8.0, 9.5), [9.5, 11.5), and [11.5, 14.20], respectively. They represent low alcohol, medium alcohol, and high alcohol, respectively.
Unusual distributions and operations on the data
The distributions of residual sugar and alcohol are non-normal-like. In order to explore whether these distributions can be transformed to normal-like, a logarithm transformation is applied to variables residual.sugar and alcohol.
Creation of new categorical variables
Based on the observation above, I will create two new categorical variables, mass.density.level and alcohol.degree for mass density and alcohol, respectively. The mass density is divided into three levels: low mass density [0.9871, 0.9920), medium mass density [0.9920, 0.9950), and high mass density [0.9950, 1.0390]. The alcohol is divided into three degrees: low alcohol [8.0, 9.5), medium alcohol [9.5, 11.5), and high alcohol [11.5, 14.20]. In addition, quality depicted by integers 0~10 may not be that easy to be connected to traditional quality. I will divide the quality into three ranks: low quality [3,5), medium quality (5,7], and high quality (7,9]. I will also create a new categorical variable quality.rank for these ranks.
Creation of categorical variable mass.density.level
ww$mass.density.level <- cut(ww$mass.density, c(0.9871, 0.9920, 0.9950, 1.0390),
labels = c("low.mass.density",
"medium.mass.density",
"high.mass.density"),
include.lowest = T)
summary(ww$mass.density.level)
## low.mass.density medium.mass.density high.mass.density
## 1446 1652 1800
In the dataset, the numbers of wines for low, medium and high mass density are close to each other.
Creation of categorical variable alcohol.degree
ww$alcohol.degree <- cut(ww$alcohol, c(8,9.5,11.5,14.2),
labels = c("low.alcohol",
"medium.alcohol",
"high.alcohol"),
include.lowest = T)
summary(ww$alcohol.degree)
## low.alcohol medium.alcohol high.alcohol
## 1436 2421 1041
In the dataset, the number of medium alcohol wines is much larger than that of low or high alcohol.
Creation of categorical variable quality.rank
ww$quality.rank <- cut(ww$quality, c(3,5,7,9),
labels = c("low.quality",
"medium.quality",
"high.quality"),
include.lowest = T)
summary(ww$quality.rank)
## low.quality medium.quality high.quality
## 1640 3078 180
In the dataset, the number of medium quality ranks is much larger than that of low and high alcohol.
Univariate Analysis
structure of dataset
str(ww)
## 'data.frame': 4898 obs. of 16 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ mass.density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ mass.density.level : Factor w/ 3 levels "low.mass.density",..: 3 2 3 3 3 3 2 3 2 2 ...
## $ alcohol.degree : Factor w/ 3 levels "low.alcohol",..: 1 1 2 2 2 2 2 1 1 2 ...
## $ quality.rank : Factor w/ 3 levels "low.quality",..: 2 2 2 2 2 2 2 2 2 2 ...
There are 16 variables and 4898 observations. The Variable 1 is an integer variable which is an observation id, the variables 2~12 are numerical variables which are input variables (based on physicochemical tests), and the variable 13 is an integer variable which is an output variable (based on sensory data). Variables 14~16 are factor variables created. The variable names are given below [1].
1."X": Id of observations (integer variable)
2."fixed.acidity": fixed acidity (tartaric acid - g/dm^3)
3."volatile.acidity": volatile acidity (acetic acid - g/dm^3)
4."citric.acid": citric acid (g/dm^3)
5."residual.sugar": residual sugar (g/dm^3)
6."chlorides": chlorides (sodium chloride - g/dm^3
7."free.sulfur.dioxide": free sulfur dioxide (mg/dm^3)
8."total.sulfur.dioxide": total sulfur dioxide (mg/dm^3)
9."mass.density": density (g/cm^3)
10."pH": pH value
11."sulphates": sulphates (potassium sulphate - g/dm3)
12."alcohol": alcohol (% by volume)
13."quality": quality (integer variable scored between 0 and 10)
14."mass.density.level": mass density levels of mass density (factor variable)
15."alcohol.degree": alcohol degree of the wine (factor variable)
16."quality.rank": quality rank of quanlty) (factor variable)
Distributions of numeric and integer variables are attributed into three types: Normal-like distribution, long-tailed normal-like distribution, and non-normal distribution. About 75% of white wines are ranked in between 3.0~6.0 and about 50% of white wines are ranked in between 5.0~6.0.
Main feature(s) of interest
The main features in the dataset are mass density, alcohol, and residual sugar. I would like to explore which features are the most important to determine wine quality. Intuitively, wine quality is greatly impacted by chemical characteristics (such as alcohol and residual sugar as well as citric acid) but is insensitive to physical characteristics (such as mass density). I will examine which features will be more significant to wine quality.
Other features that may support the main feature(s) of interest
Apart from the main features, all the other features may influence wine quality through correlation directly to quality or through correlation to the main features. However I assume that pH, vcitric acid, and total sulfur dioxide as well as their combination are features that contribute most to the wine quality. I will figure out which ones are the most important to wine quality first by exploring the correlation between the features and wine quality in next section.
New variables created from existing variables
In order to connect my analysis to traditional awareness, three new categorical variables are created: mass.density.level, alcohol.degree, and quality.rank. The first two variables are created by their distributions, and the third one is created by dividing the range of quality into three subdivisions uniformly. I understand that the variables created this way may not be consistent completely with professional ones. But it is good enough for the project here.
Unusual distributions and operations on the data
Original distributions of residual sugar and alcohol are non-normal and unusual. I log-transform them in order to gain normal distributions. The transformed distribution of alcohol is closed to a normal-like distribution. However, the transformed distribution of residual sugar is a bimodal distribution.
Bivariate Plots Section
Correlation analysis
Create a new dataframe without categorical variables
cor_vars <- names(ww) %in% c("X", "mass.density.level",
"alcohol.degree", "quality.rank")
ww_num <- ww[!cor_vars]
Calculation of correlation coefficients
correlate(ww_num)
##
## CORRELATIONS
## ============
## - correlation type: pearson
## - correlations shown only when both variables are numeric
##
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity . -0.023 0.289
## volatile.acidity -0.023 . -0.149
## citric.acid 0.289 -0.149 .
## residual.sugar 0.089 0.064 0.094
## chlorides 0.023 0.071 0.114
## free.sulfur.dioxide -0.049 -0.097 0.094
## total.sulfur.dioxide 0.091 0.089 0.121
## mass.density 0.265 0.027 0.150
## pH -0.426 -0.032 -0.164
## sulphates -0.017 -0.036 0.062
## alcohol -0.121 0.068 -0.076
## quality -0.114 -0.195 -0.009
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.089 0.023 -0.049
## volatile.acidity 0.064 0.071 -0.097
## citric.acid 0.094 0.114 0.094
## residual.sugar . 0.089 0.299
## chlorides 0.089 . 0.101
## free.sulfur.dioxide 0.299 0.101 .
## total.sulfur.dioxide 0.401 0.199 0.616
## mass.density 0.839 0.257 0.294
## pH -0.194 -0.090 -0.001
## sulphates -0.027 0.017 0.059
## alcohol -0.451 -0.360 -0.250
## quality -0.098 -0.210 0.008
## total.sulfur.dioxide mass.density pH sulphates
## fixed.acidity 0.091 0.265 -0.426 -0.017
## volatile.acidity 0.089 0.027 -0.032 -0.036
## citric.acid 0.121 0.150 -0.164 0.062
## residual.sugar 0.401 0.839 -0.194 -0.027
## chlorides 0.199 0.257 -0.090 0.017
## free.sulfur.dioxide 0.616 0.294 -0.001 0.059
## total.sulfur.dioxide . 0.530 0.002 0.135
## mass.density 0.530 . -0.094 0.074
## pH 0.002 -0.094 . 0.156
## sulphates 0.135 0.074 0.156 .
## alcohol -0.449 -0.780 0.121 -0.017
## quality -0.175 -0.307 0.099 0.054
## alcohol quality
## fixed.acidity -0.121 -0.114
## volatile.acidity 0.068 -0.195
## citric.acid -0.076 -0.009
## residual.sugar -0.451 -0.098
## chlorides -0.360 -0.210
## free.sulfur.dioxide -0.250 0.008
## total.sulfur.dioxide -0.449 -0.175
## mass.density -0.780 -0.307
## pH 0.121 0.099
## sulphates -0.017 0.054
## alcohol . 0.436
## quality 0.436 .
Correlation Plots
In order to explore correlations of all numeric variables, I will plot a colorRamp plot of the correlations.
ctab <- cor(ww_num)
colorfun <- colorRamp(c("#CC0000", "white", "#3366CC"), space = "Lab")
plotcorr(ctab, mar = c(0, 0, 0, 0), col = rgb(colorfun((ctab+1)/2),
maxColorValue = 255))

Observations from correlation calculation and correlation plots
In correlation plots, the “blue” symbol represents positive correlation coefficients, while the “red” symbol represents negative correlation coefficients. The greater the deviation of the symbol from a circle, the lager the correlation coefficient. Obviously, the correlation coefficients are quite different parts of variables in the dataset. I introduce a correlation “order” to specify the correlation strength.
1. First-order correlations
Let "r" = correlation coefficient. The 1st-order correlation is a strong correlation with abs(r) > 0.7. It includes the correlations of variable pairs below.
[density, residual sugar]: r = 0.839
[density, alcohol]: r = -0.780
2. Second-order correlations
The 2nd-order correlation is a medium correlation with 0.3 < abs(r) <= 0.7 . It includes the correlations of variable pairs below.
[quality, density]: r = -0.307
[quality, alcohol]: r = 0.436
[density, total sulfur dioxide]: r = 0.530
[alcohol, total sulfur dioxide]: r = -0.449
[alcohol, residual sugar]: r = -0.451
[alcohol, chlorides]: r = -0.360
[total sulfur dioxide, free sulfur dioxide]: r = 0.616
[residual sugar, total sulfur dioxide]: r = 0.401
[pH, fixed acidity]: r = -0.426
3. Third-order correlations
The 3rd-order correlation is a weak correlation with 0.2 < Abs(r) <= 0.3 . It includes the correlations of variable pairs below.
[quality, chlorides]: r = -0.210
[density, free sulfur dioxide]: r = 0.294
[density, fixed acidity]: r = 0.265
[density, chlorides]: r = 0.257
[alcohol, free sulfur dioxide]: r = -0.250
[fixed acidity, citric acid]: r = 0.289
[residual sugar, free sulfur dioxide]: r = 0.299
My data analysis focus
It is shown from the correlation analysis above, all the strong correlations occur through density: density and alcohol (negative) as well as density and residual sugar (positive). Thus the physical characteristic density is one of most important features of the dataset.
Up to the 2nd-order correlation, wine quality is only correlated to density and alcohol. It is again demonstrated that density is one of most important features to the quality.
Apart from the 1st-order correlation between density and alcohol, both density and alcohol are correlated to residual sugar (via the 1st-order correlation with density and via the 2nd-order correlation with alcohol) and total sulfur dioxide (via the 2nd-order correlation with both). In addition, alcohol and chlorides as well as residual sugar and total sulfur dioxide are correlated by the 2nd-order correlation.
The correlation tree is:
The 1st generation: quality
The 2nd generation: density and alcohol
The 3rd generation: residual sugar, total sulfur dioxide, and chlorides.
The scenario of my data analysis in this project is to investigate how the features (density, alcohol, residual sugar, total sulfur dioxide, and chlorides) individually impact wine quality as well as how the feature combinations influence wine quality.
Scatter plots versus quality with linear fit as well as box plots by quality rank
In order to explore details of correlations between quality and main features, I will plot scatter plots, box plots, and linear fits of the main features vs. quality.
Function to create scatter plots and linear fit
plot_scat_plot <- function(y_str,x_str, ymin, ymax, dy, xmin, xmax, dx)
{
ggplot(aes_string(y = y_str, x = x_str), data = ww) +
geom_jitter(alpha = 1/4, color = "#3366FF") +
scale_y_continuous(limits = c(ymin, ymax), breaks = seq(ymin, ymax, dy)) +
ggtitle("Scatter plot & linear fit") +
stat_smooth(method = "lm", color = "red") +
scale_x_continuous(limits = c(xmin, xmax), breaks = seq(xmin, xmax, dx))
}
Function to create box plots
plot_box_plot <- function(y_str, x_str, ymin, ymax)
{
ggplot(aes_string(y = y_str, x = x_str),data = ww) +
geom_boxplot() +
coord_cartesian(ylim = c(ymin, ymax)) +
ggtitle("Box plot")
}
Residual sugar versus quality
p1 <- plot_scat_plot(y_str = "residual.sugar", x_str = "quality",
ymin = 0, ymax = 20, dy = 5,
xmin = 3, xmax = 9, dx = 2)
p2 <- plot_box_plot(y_str= "residual.sugar", x_str = "quality.rank",
ymin = 0, ymax = 20)
grid.arrange(p1, p2, ncol = 2)

by(ww$residual.sugar, ww$quality.rank, summary)
## ww$quality.rank: low.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 6.625 7.054 11.020 23.500
## --------------------------------------------------------
## ww$quality.rank: medium.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 4.800 6.083 9.200 65.800
## --------------------------------------------------------
## ww$quality.rank: high.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.075 4.300 5.628 8.150 14.800
Correlation between residual sugar and quality is negative overall. The medians and the 3rd quartiles decrease with wine quality but the 1st quartiles increase with wine quality. Approximately the residual sugar decreases with wine quality. Thus higher quality rank contains a little less residual sugar.
Chlorides versus quality
p1 <- plot_scat_plot(y_str = "chlorides", x_str = "quality",
ymin = 0.015, ymax = 0.075, dy = 0.005,
xmin = 3, xmax = 9, dx = 2)
p2 <- plot_box_plot(y_str= "chlorides", x_str = "quality.rank",
ymin = 0.015, ymax = 0.075)
grid.arrange(p1, p2, ncol = 2)

by(ww$chlorides, ww$quality.rank, summary)
## ww$quality.rank: low.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05144 0.05300 0.34600
## --------------------------------------------------------
## ww$quality.rank: medium.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03400 0.04100 0.04321 0.04800 0.25500
## --------------------------------------------------------
## ww$quality.rank: high.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03550 0.03801 0.04400 0.12100
Correlation between chlorides and quality is negative. The correlation can be well represented by a linear correlation. This is because that the quality is also correlated directly to chlorides by the 3rd-order correlation with r = -0.210. The medians and quartiles decrease with wine quality. Thus higher quality ranks contain smaller amount of chlorides.
Total sulfur dioxide versus quality
p1 <- plot_scat_plot(y_str = "total.sulfur.dioxide", x_str = "quality",
ymin = 0, ymax = 300, dy = 50,
xmin = 3, xmax = 9, dx = 2)
p2 <- plot_box_plot(y_str= "total.sulfur.dioxide", x_str = "quality.rank",
ymin = 0, ymax = 300)
grid.arrange(p1, p2, ncol = 2)

by(ww$total.sulfur.dioxide, ww$quality.rank, summary)
## ww$quality.rank: low.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 117.0 149.0 148.6 182.0 440.0
## --------------------------------------------------------
## ww$quality.rank: medium.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 105.0 129.0 133.6 159.0 294.0
## --------------------------------------------------------
## ww$quality.rank: high.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.0 102.8 122.0 125.9 148.5 212.5
Correlation between total sulfur dioxide and quality is negative. The medians and quartiles decrease with wine quality. Thus higher quality rank contains less total sulfur dioxide.
Mass density versus quality
p1 <- plot_scat_plot(y_str = "mass.density", x_str = "quality",
ymin = 0.986, ymax = 1.004, dy = 0.002,
xmin = 3, xmax = 9, dx = 2)
p2 <- plot_box_plot(y_str= "mass.density", x_str = "quality.rank",
ymin = 0.986, ymax = 1.004)
grid.arrange(p1, p2, ncol = 2)

by(ww$mass.density, ww$quality.rank, summary)
## ww$quality.rank: low.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9932 0.9951 0.9952 0.9971 1.0020
## --------------------------------------------------------
## ww$quality.rank: medium.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9912 0.9930 0.9935 0.9955 1.0390
## --------------------------------------------------------
## ww$quality.rank: high.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
Correlation between mass density and quality is negative and nearly linear. The medians and quartiles decrease with wine quality. Thus the mass density of higher quality rank is smaller.
Alcohol versus quality
p1 <- plot_scat_plot(y_str = "alcohol", x_str = "quality",
ymin = 8, ymax = 15, dy = 1,
xmin = 3, xmax = 9, dx = 2)
p2 <- plot_box_plot(y_str= "alcohol", x_str = "quality.rank",
ymin = 8, ymax = 15)
grid.arrange(p1, p2, ncol = 2)

by(ww$alcohol, ww$quality.rank, summary)
## ww$quality.rank: low.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.20 9.60 9.85 10.40 13.60
## --------------------------------------------------------
## ww$quality.rank: medium.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.8 10.8 10.8 11.8 14.2
## --------------------------------------------------------
## ww$quality.rank: high.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.65 12.60 14.00
Correlation between alcohol and quality is positive and nearly linear. The medians and quartiles increase with alcohol. Thus higher quality rank contains more alcohol and lower quality rank contains less alcohol.
Observations from the plots
Wine quality is proportional to alcohol, inversely proportional to chlorides, total sulfur dioxide, mass density, and residual sugar. Thus higher quality rank should have less mass density and contain higher alcohol content, lower amount of chlorides, lower amount of total sulfur dioxide, and lower amount of residual sugar.
Scatter plots versus alcohol with linear fit as well as box plots by alcohol degree
In order to explore details of correlations between alcohol and other main features, I will plot scatter plots, box plots, and linear fits of the main features vs. alcohol.
Residual sugar versus alcohol
p1 <- plot_scat_plot(y_str = "residual.sugar", x_str = "alcohol",
ymin = 0, ymax = 20, dy = 5,
xmin = 8, xmax = 14.2, dx = 2)
p2 <- plot_box_plot(y_str= "residual.sugar", x_str = "alcohol.degree",
ymin = 0, ymax = 20)
grid.arrange(p1, p2, ncol = 2)

by(ww$residual.sugar, ww$alcohol.degree, summary)
## ww$alcohol.degree: low.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 6.375 10.600 9.979 14.200 31.600
## --------------------------------------------------------
## ww$alcohol.degree: medium.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.500 4.200 5.256 8.000 26.050
## --------------------------------------------------------
## ww$alcohol.degree: high.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 2.800 4.083 5.200 65.800
Correlation between residual sugar and alcohol is negative. The medians and quartiles of Low and medium alcohol wines decrease with alcohol degree. Generally the lower quality ranks contain higher residual sugar and higher alcohol wines contain lower residual sugar.
Chlorides versus alcohol
p1 <- plot_scat_plot(y_str = "chlorides", x_str = "alcohol",
ymin = 0.01, ymax = 0.07, dy = 0.01,
xmin = 8, xmax = 14.2, dx = 2)
p2 <- plot_box_plot(y_str= "chlorides", x_str = "alcohol.degree",
ymin = 0.01, ymax = 0.07)
grid.arrange(p1, p2, ncol = 2)

by(ww$chlorides, ww$alcohol.degree, summary)
## ww$alcohol.degree: low.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02800 0.04400 0.04900 0.05628 0.05600 0.30100
## --------------------------------------------------------
## ww$alcohol.degree: medium.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04425 0.04900 0.34600
## --------------------------------------------------------
## ww$alcohol.degree: high.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.02900 0.03400 0.03482 0.03800 0.16000
Correlation between chlorides and alcohol is negative and nearly linear. The medians and quartiles decrease approximately linearly with wine alcohol. Thus higher alcohol wines contain smaller amount of chlorides.
Total sulfur dioxide versus alcohol
p1 <- plot_scat_plot(y_str = "total.sulfur.dioxide", x_str = "alcohol",
ymin = 50, ymax = 250, dy = 50,
xmin = 8, xmax = 14.2, dx = 2)
p2 <- plot_box_plot(y_str= "total.sulfur.dioxide", x_str = "alcohol.degree",
ymin = 50, ymax = 250)
grid.arrange(p1, p2, ncol = 2)

by(ww$total.sulfur.dioxide, ww$alcohol.degree, summary)
## ww$alcohol.degree: low.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.0 135.0 165.0 163.4 191.0 344.0
## --------------------------------------------------------
## ww$alcohol.degree: medium.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 107.0 131.0 134.2 160.0 440.0
## --------------------------------------------------------
## ww$alcohol.degree: high.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 93.0 111.0 113.6 131.0 294.0
Correlation between total sulfur dioxide and alcohol is negative nearly linear. The medians and quartiles decrease with alcohol approximately linearly. Lower alcohol wines contain higher total sulfur dioxide and higher alcohol wines contain lower total sulfur dioxide.
mass density versus alcohol
p1 <- plot_scat_plot(y_str = "mass.density", x_str = "alcohol",
ymin = 0.987, ymax = 1.003, dy = 0.002,
xmin = 8, xmax = 14.2, dx = 2)
p2 <- plot_box_plot(y_str= "mass.density", x_str = "alcohol.degree",
ymin = 0.987, ymax = 1.003)
grid.arrange(p1, p2, ncol = 2)

by(ww$mass.density, ww$alcohol.degree, summary)
## ww$alcohol.degree: low.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9919 0.9954 0.9970 0.9968 0.9984 1.0100
## --------------------------------------------------------
## ww$alcohol.degree: medium.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9894 0.9921 0.9934 0.9937 0.9951 1.0030
## --------------------------------------------------------
## ww$alcohol.degree: high.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9897 0.9906 0.9908 0.9916 1.0390
Correlation between mass density and alcohol is negative and linear. The median and quartiles decrease linearly with alcohol. The mass density of lower alcohol wine is higher and the mass density of higher alcohol wine is lower.
Quality versus alcohol
p1 <- plot_scat_plot(y_str = "quality", x_str = "alcohol",
ymin = 3, ymax = 9, dy = 1,
xmin = 8, xmax = 14.2, dx = 2)
p2 <- plot_box_plot(y_str= "quality", x_str = "alcohol.degree",
ymin = 3, ymax = 9)
grid.arrange(p1, p2, ncol = 2)

by(ww$quality, ww$alcohol.degree, summary)
## ww$alcohol.degree: low.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 5.000 5.493 6.000 8.000
## --------------------------------------------------------
## ww$alcohol.degree: medium.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.836 6.000 9.000
## --------------------------------------------------------
## ww$alcohol.degree: high.alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 6.000 6.000 6.505 7.000 9.000
Correlation between alcohol and quality is positive. The quality increases with alcohol overall. Thus higher quality ranks contain higher alcohol.
Observations from the plots
Alcohol is inversely proportional to residual sugar, chlorides, total sulfur dioxide, and mass density, but proportional to quality. Thus higher alcohol wine should contain lower amount of residual sugar, chlorides, and total sulfur dioxide, and have less mass density. This kind of wine is higher quality rank.
Scatter plots versus mass density with linear fit as well as box plots by mass density level
In order to explore details of correlations between mass density and other main features, I will plot scatter plots, box plots, and linear fits of the main features vs. mass density.
Residual sugar versus mass density
p1 <- plot_scat_plot(y_str = "residual.sugar", x_str = "mass.density",
ymin = 0, ymax = 20, dy = 5,
xmin = 0.987, xmax = 1.002, dx = 0.002)
p2 <- plot_box_plot(y_str= "residual.sugar", x_str = "mass.density.level", ymin = 0, ymax = 20)
grid.arrange(p1, p2, ncol = 2)

by(ww$residual.sugar, ww$mass.density.level, summary)
## ww$mass.density.level: low.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 1.900 2.577 3.400 10.800
## --------------------------------------------------------
## ww$mass.density.level: medium.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.500 3.500 4.303 6.500 15.500
## --------------------------------------------------------
## ww$mass.density.level: high.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.30 8.20 11.45 11.37 14.20 65.80
Correlation between residual sugar and mass density is positive and nearly linear. The medians and quartiles increase with mass density. Thus higher mass density wine contains higher residual sugar.
Chlorides versus mass density
p1 <- plot_scat_plot(y_str = "chlorides", x_str = "mass.density",
ymin = 0.01, ymax = 0.07, dy = 0.01,
xmin = 0.987, xmax = 1.002, dx = 0.002)
p2 <- plot_box_plot(y_str= "chlorides", x_str = "mass.density.level", ymin = 0.01, ymax = 0.07)
grid.arrange(p1, p2, ncol = 2)

by(ww$chlorides, ww$mass.density.level, summary)
## ww$mass.density.level: low.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03000 0.03500 0.03644 0.04100 0.16700
## --------------------------------------------------------
## ww$mass.density.level: medium.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01300 0.03600 0.04300 0.04827 0.05000 0.27100
## --------------------------------------------------------
## ww$mass.density.level: high.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.04200 0.04800 0.05098 0.05400 0.34600
Correlation between chlorides and mass density is positive and nearly linear. Chlorides increase with mass density. Thus lower mass density wines contain lower chlorides and higher mass density wines contain higher chlorides.
Total sulfur dioxide versus mass density
p1 <- plot_scat_plot(y_str = "total.sulfur.dioxide", x_str = "mass.density",
ymin = 50, ymax = 250, dy = 50,
xmin = 0.987, xmax = 1.002, dx = 0.002)
p2 <- plot_box_plot(y_str= "total.sulfur.dioxide", x_str = "mass.density.level", ymin = 50, ymax = 250)
grid.arrange(p1, p2, ncol = 2)

by(ww$total.sulfur.dioxide, ww$mass.density.level, summary)
## ww$mass.density.level: low.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 92.0 111.0 111.8 130.0 294.0
## --------------------------------------------------------
## ww$mass.density.level: medium.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 107.0 128.0 130.8 154.0 440.0
## --------------------------------------------------------
## ww$mass.density.level: high.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41.0 140.0 167.0 166.7 193.0 366.5
Correlation between total sulfur dioxide and mass density is positive and nearly linear. Total sulfur dioxide increases with mass density. Thus lower mass density wines contain lower total sulfur dioxide and higher mass density wines contain higher total sulfur dioxide.
Alcohol versus mass density
p1 <- plot_scat_plot(y_str = "alcohol", x_str = "mass.density",
ymin = 8, ymax = 15, dy = 1,
xmin = 0.987, xmax = 1.002, dx = 0.002)
p2 <- plot_box_plot(y_str= "alcohol", x_str = "mass.density.level", ymin = 8, ymax = 15)
grid.arrange(p1, p2, ncol = 2)

by(ww$alcohol, ww$mass.density.level, summary)
## ww$mass.density.level: low.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 11.20 11.90 11.88 12.50 14.20
## --------------------------------------------------------
## ww$mass.density.level: medium.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.0 9.8 10.4 10.4 10.9 13.5
## --------------------------------------------------------
## ww$mass.density.level: high.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.000 9.400 9.516 9.800 12.800
Correlation between alcohol and mass density is negative and linear. Alcohol decreases linearly with mass density. Thus lower mass density wines contain higher alcohol and higher mass density wines contain lower alcohol.
Quality versus residual sugar
p1 <- plot_scat_plot(y_str = "quality", x_str = "mass.density",
ymin = 2, ymax = 9, dy = 1,
xmin = 0.987, xmax = 1.002, dx = 0.002)
p2 <- plot_box_plot(y_str= "quality", x_str = "mass.density.level", ymin = 2, ymax = 9)
grid.arrange(p1, p2, ncol = 2)

by(ww$quality, ww$mass.density.level, summary)
## ww$mass.density.level: low.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 6.000 6.000 6.331 7.000 9.000
## --------------------------------------------------------
## ww$mass.density.level: medium.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.788 6.000 8.000
## --------------------------------------------------------
## ww$mass.density.level: high.mass.density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.597 6.000 9.000
Correlation between quality and mass density is negative. Quality decreases with mass density. Thus lower mass density wine is higher quality rank and higher mass density wine is lower quality rank.
Observations from the plots
Mass density is proportional to residual sugar, chlorides, total sulfur dioxide, but inversely proportional to alcohol and quality. Thus higher mass density wine contains higher amount of residual sugar, chlorides, and total sulfur dioxide, but lower amount of alcohol. Such kind of wine is low quality rank.
Bivariate Analysis
Relationships of the feature(s) of interest with other features in the dataset
Wine quality increases with alcohol and decreases with mass density. Thus higher quality rank should be higher alcohol and lower mass density.
Alcohol decreases with residual sugar, chlorides, total sulfur dioxide, and mass density. Thus higher alcohol wine is less mass density and contains lower amount of residual sugar, chlorides, and total sulfur dioxide.
Mass density increases with residual sugar, chlorides, and total sulfur dioxide, and decreases with alcohol. Thus higher mass density wine contains higher amount of residual sugar, chlorides, and total sulfur dioxide, but lower amount of alcohol.
Finally wine quality decreases with chlorides, total sulfur dioxide, and residual sugar. Thus higher quality rank should have less mass density and contain higher alcohol content, lower amount of chlorides, lower amount of total sulfur dioxide, and lower amount of residual sugar.
Interesting relationships between other features (not the main feature(s) of interest)
Apart from the 1st-order and 2nd-order correlations of the main features considered in the analysis, some 2nd-order correlations of other features do not take into account in the analysis. These correlations include total sulfur dioxide and free sulfur dioxide (r = 0.616) as well as pH and fixed acidity (r = -0.426). These features are either correlated to the quality via the 3rd generation main features or are not correlated to the quality within 2nd-order correlation. Thus these correlations are not taken into account in this analysis.
The strongest relationship
The mass density is strongly and positively correlated to residual sugar. The mass density is also strongly and negatively correlated to alcohol. The wine quality is positively correlated to alcohol and negatively correlated mass density. Though these features the wine quality is correlated with other features.
Final Plots and Summary
Plot One
p1 <- plot_hist_by_color(x_str = "alcohol", by_str = "quality.rank",
bin_width = 0.1, xmin = 8, xmax = 15, dx = 1,
ymin = 0, ymax = 250, dy = 50) +
xlab("Alcohol (% volume)") + ylab("Count") +
ggtitle("Alcohol frequency by quality rank") +
theme(plot.title = element_text(size=18), legend.justification=c(1,0),
legend.position=c(1,0.75), axis.text = element_text(size = 12),
axis.title=element_text(size=15), legend.title=element_text(size=12),
legend.text = element_text(size = 12)) +
scale_fill_discrete(name="Quality rank") +
scale_color_discrete(name="Quality rank")
p2 <- plot_scat_by_color(y_str = "alcohol", x_str = "quality",
by_str = "quality.rank",
ymin = 8, ymax = 15, dy = 1) +
xlab("Quality") + ylab("Alcohol (% volume)") +
ggtitle("Alcohol vs quality by quality rank and linear fit") +
scale_colour_discrete(name = "Quality rank") +
scale_x_continuous(limits = c(3, 9), breaks = seq(3, 9, 2)) +
theme(plot.title = element_text(size=18), legend.justification=c(1,0),
legend.position=c(0.34,0.75), axis.text = element_text(size = 12),
axis.title=element_text(size=15), legend.title=element_text(size=12),
legend.text = element_text(size = 12))
p3 <- plot_density_by_color(x_str = "alcohol", by_str = "quality.rank") +
xlab("Alcohol (% volume)") + ylab("Density") +
ggtitle("Alcohol density by quality rank") +
scale_colour_discrete(name = "Quality rank") +
scale_x_continuous(limits = c(8, 15), breaks = seq(8, 15, 1)) +
theme(plot.title = element_text(size=18), legend.justification=c(1,0),
legend.position=c(1,0.75), axis.text = element_text(size = 12),
axis.title=element_text(size=15), legend.title=element_text(size=12),
legend.text = element_text(size = 12))
p4 <- plot_box_by_color(y_str = "alcohol", x_str = "quality.rank",
by_str = "quality.rank", ymin = 8, ymax = 15) +
xlab("Quality rank") + ylab("Alcohol (% volume)") +
ggtitle("Alcohol by quality rank") +
scale_fill_discrete(name = "Quality rank") +
scale_y_continuous(limits = c(8, 15), breaks = seq(8, 15, 1)) +
theme(plot.title = element_text(size=18), legend.justification=c(1,0),
legend.position=c(0.34,0.75), axis.text = element_text(size = 12),
axis.title=element_text(size=15), legend.title=element_text(size=12),
legend.text = element_text(size = 12))
grid.arrange(p1,p2,p3,p4,ncol = 2)

Description One
Distributions of different quality ranks are very well separated from each other. Correlation between alcohol and quality is positive and nearly linear. The median increases with quality. Higher quality rank contains more alcohol and lower quality rank contains less alcohol, which can be seen clearly from statistical insights given below.
Statistical insights
by(ww$alcohol.degree, ww$quality.rank, summary)
## ww$quality.rank: low.quality
## low.alcohol medium.alcohol high.alcohol
## 798 768 74
## --------------------------------------------------------
## ww$quality.rank: medium.quality
## low.alcohol medium.alcohol high.alcohol
## 620 1601 857
## --------------------------------------------------------
## ww$quality.rank: high.quality
## low.alcohol medium.alcohol high.alcohol
## 18 52 110
by(ww$alcohol, ww$quality.rank, summary)
## ww$quality.rank: low.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.20 9.60 9.85 10.40 13.60
## --------------------------------------------------------
## ww$quality.rank: medium.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.8 10.8 10.8 11.8 14.2
## --------------------------------------------------------
## ww$quality.rank: high.quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.65 12.60 14.00
Most high quality wines contain 12.5% alcohol by volume, most low quality wines contain 9.4% alcohol by volume, and most medium quality wines contain 11% alcohol by volume. These numbers are very closed to or a little bit more than the medians: 9.6%, 10.8%, and 12.0% alcohol by volume for low, medium, and high quality wines. This is a reasonable and significant result.
Plot Two
p1 <- plot_scat_multi_var_by_color(y_str = "mass.density", x_str = "alcohol",
by_str = "quality.rank",
xmin = 8, xmax = 14, dx = 1,
ymin = 0.985, ymax = 1.005, dy = 0.005) +
ylab("Mass density (g/cm^3)") + xlab("Alcohol (% volume)") +
ggtitle("Mass density vs alcohol by quality rank") +
theme(plot.title = element_text(size=18),legend.justification=c(1,0),
legend.position=c(1,0.75), axis.text = element_text(size = 12),
axis.title=element_text(size=15), legend.title=element_text(size=12),
legend.text = element_text(size = 12))
p2 <- plot_box_by_color(y_str = "mass.density", x_str = "alcohol.degree",
by_str = "quality.rank",
ymin = 0.985, ymax = 1.005) +
xlab("Alcohol degree") + ylab("Mass density (g/cm^3)") +
ggtitle("Mass density vs alcohol degree by quality rank") +
theme(plot.title = element_text(size=18),legend.justification=c(1,0),
legend.position=c(1,0.75), axis.text = element_text(size = 12),
axis.title=element_text(size=15), legend.title=element_text(size=12),
legend.text = element_text(size = 12)) +
scale_fill_discrete(name = "Quality rank")
grid.arrange(p1, p2, ncol = 2)

Description Two
Lower quality ranks have higher mass density and lower alcohol, while higher quality ranks have lower mass density and higher alcohol.
In medium and high alcohol degrees, mass density decreases with quality and thus the mass density of higher quality rank is smaller. However, in low alcohol degree, the mass density increases with quality and thus the mass density of higher quality is larger. This is a very interesting finding.
Statistical insights
by(ww$mass.density.level, ww$alcohol.degree, summary)
## ww$alcohol.degree: low.alcohol
## low.mass.density medium.mass.density high.mass.density
## 2 288 1146
## --------------------------------------------------------
## ww$alcohol.degree: medium.alcohol
## low.mass.density medium.mass.density high.mass.density
## 576 1210 635
## --------------------------------------------------------
## ww$alcohol.degree: high.alcohol
## low.mass.density medium.mass.density high.mass.density
## 868 154 19
Most high quality wines contain high alcohol (~12.5% by volume) with low mass density of 0.990 g/cm^3, most low quality wines contain low alcohol (~9.4% by volume) with high mass density of 0.995 g/cm^3, while most medium quality wines contain 11% alcohol by volume with 0.992 g/cm^3, which are in between those of low and high quality wines.
Plot Three
p1 <- plot_scat_multi_var_cross(y_str = "residual.sugar", x_str = "alcohol",
by_str = "mass.density.level",
ymin = 0, ymax = 22, dy = 5) +
xlab("Alcohol (% volume)") + ylab("Residual sugar (g/dm^3)") +
ggtitle("Residual sugar vs alcohol by quality rank and mass density level") +
scale_colour_discrete(name = "Mass density level") +
theme(plot.title = element_text(size = 18), legend.justification = c(1,0),
legend.position = c(1, 0.68), axis.text = element_text(size = 12),
axis.title=element_text(size=15), legend.title=element_text(size=12),
legend.text = element_text(size = 12))
p2 <- plot_box_multi_var_cross(y_str = "residual.sugar",
x_str = "alcohol.degree",
by_str = "mass.density.level",
ymin = 0, ymax = 22) +
xlab("Alcohol degree") + ylab("Residual sugar (g/dm^3)") +
ggtitle("Residual sugar vs alcohol degree by quality rank and mass density level") +
scale_fill_discrete(name = "Mass density level") +
theme(plot.title = element_text(size = 18),legend.justification = c(1,0),
legend.position = c(1, 0.68), axis.text = element_text(size = 12),
axis.title=element_text(size=15), legend.title=element_text(size=12),
legend.text = element_text(size = 12))
grid.arrange(p1, p2, ncol = 1)

Description Three
In all quality ranks, low mass density almost always corresponds to high alcohol and low residual sugar, while high mass density almost always corresponds to low alcohol and high residual sugar. It seems that in low quality ranks, the number of wines with high residual sugar, low alcohol, and high mass density is larger than other wines. In medium quality rank, the number of low mass density wines is quite close to that of medium and high mass density wines. In high quality rank, the number of wines with low mass density and thus high alcohol and low residual sugar is little bit larger than the others.
In all quality ranks and for all alcohol degrees, the residual sugar increases with mass density monotonically.
Statistical insights
by(ww$mass.density.level, ww$quality.rank, summary)
## ww$quality.rank: low.quality
## low.mass.density medium.mass.density high.mass.density
## 189 606 845
## --------------------------------------------------------
## ww$quality.rank: medium.quality
## low.mass.density medium.mass.density high.mass.density
## 1151 998 929
## --------------------------------------------------------
## ww$quality.rank: high.quality
## low.mass.density medium.mass.density high.mass.density
## 106 48 26
by(ww$alcohol.degree, ww$mass.density.level, summary)
## ww$mass.density.level: low.mass.density
## low.alcohol medium.alcohol high.alcohol
## 2 576 868
## --------------------------------------------------------
## ww$mass.density.level: medium.mass.density
## low.alcohol medium.alcohol high.alcohol
## 288 1210 154
## --------------------------------------------------------
## ww$mass.density.level: high.mass.density
## low.alcohol medium.alcohol high.alcohol
## 1146 635 19
Most low quality wines have high mass density (>=0.9950 g/cm^3) and contain low alcohol (~9.4% by volume) and high residual sugar (>=7.5 g/dm^3), while most high quality wines have low mass density (<0.9920 g/cm^3) and contain high alcohol (~12.5% by volume) and low residual sugar (<5 g/dm^3). In medium quality wines, when only the mass density is considered, the number of low mass density wines is higher than all the others. However, as already known, most medium quality wines contain medium alcohol (~11% by volume). When both alcohol and mass density are taken into account together, most medium quality wines will have medium mass density (>=0.9920 g/cm^3 and <0.9950 g/cm^3) contain medium alcohol and medium residual sugar (>=5 g/dm^3 and <7.5 g/dm^3).
Reflection
The dataset I explored contains 13 variables and 4898 observations. I created 3 new variables during the analysis. I am so confused and upset at very beginning when I was trying to work on the data analysis because I knew very little about wines and I did not know where I began with and what I should focus on. What I only knew was to explore the relationship between wine quality and physicochemical characteristics (the features). So I started to make plots aimlessly for quality and variables. After a couple of days I realized that performing data analysis by just making plots might spend lot of time and eventually may be just waste of time without any results. So I set about statistical analysis on the data. I first computed medians, means, 1st and 3rd quartiles for every single variable, and made the population plots for the variables. Then I explored the relationships of quality and variables. The start point of this exploration is to compute and check the correlation coefficients of every pair of variables and quality. I grouped the correlations into different orders. Through the correlation coefficients, I figured out the possible features that may have most significant impact on the wine quality. I isolated the features that have stronger correlations directly with the quality and the features that have strong correlations with the features directly correlated to the quality. To make this much clearer I made a correlation tree according to the correlation coefficients. From this tree, I had a very clear idea what I should focus my study on and where I should start. Therefore, I proposed to investigate how the wine quality changes with the main features: density, alcohol, residual sugar, total sulfur dioxide, and chlorides, as well as how the correlations between features influence the wine quality.
After visualized analysis by plotting the relationships of features and wine quality, I found and confirmed the following relations.
Wine quality increases with alcohol and decreases with mass density. Thus higher quality wine should be the wine with higher alcohol and lower mass density.
Alcohol decreases with residual sugar, chlorides, total sulfur dioxide, and mass density. Thus higher alcohol wine has less mass density and contains lower amount of residual sugar, chlorides, and total sulfur dioxide.
Mass density increases with residual sugar, chlorides, and total sulfur dioxide, and decreases with alcohol. Thus higher mass density wine contains higher amount of residual sugar, chlorides, and total sulfur dioxide, but lower amount of alcohol.
Finally wine quality decreases with chlorides, total sulfur dioxide, and residual sugar. Thus higher quality wine should have less mass density and contain higher alcohol content, lower amount of chlorides, lower amount of total sulfur dioxide, and lower amount of residual sugar.
Some semi-quantitative results:
Table 1. Alcohol, mass density and residual sugar for different quality wines
------------------------------------------------------------------------
Wine quality | Alcohol | Mass density | Residual sugar
| (% volume) | (g/cm^3) | (g/dm^3)
------------------------------------------------------------------------
Low qualiy | 8 ~ 9.5 | 0.9950 ~ 1.0390 | 7.5 ~ 65.8
Medium quality | 9.5 ~ 11.5 | 0.9920 ~ 0.9950 | 5.0 ~ 7.5
High quality | 11.5 ~ 14.2 | 0.9871 ~ 0.9920 | 0.6 ~ 5.0
------------------------------------------------------------------------
What surprised me during the investigation are:
The wine quality is so dependent of mass density, a physical characteristic,
The relations of wine quality with features change when some new features are introduced due to the correlations.
Because some unexplored features may have large effect on wine quality, I will propose to explore the relationships of wine quality with the features having higher order correlations with the quality (I did not examine those features in this analysis). Relative ratios of physicochemical components may make significant contribution to wine quality. Therefore exploring the relationship of wine quality with the relative ratios is also one of the tasks in the future. In addition, I am also interested in testing some models such as linear model to predict the wine quality. Recent researches [1, 2] would be valuable references and resources for my research for comparisons.